VAST Challenge 2021. Applying visual analytics to gain Insight…..
The assignment is based on VAST Challenge 2021 where it has selected solve the Mini Challenges 2 (MC2) out of the 3 mini challenges via visual analytics out. The fictitious scenario was based on a Tethys-based natural gas production company GAStech in the country of Kronos where it has made remarkable profits and strong relationship with the government. However, GAStech has not been as successful in demonstrating environmental stewardship. Several employee went missing in January 2014 after a successful initial public offering. It was suspected that the disappearance might be associated with an organization known as the Protectors of Kronos (POK). However, things may not be so simplicity. Thus, one has been tasked to assist the law enforcement from Kronos and Tethys in solving the assigned task via visual analytics.
The main task is to identify suspicious patterns of behavior from the transaction done by GASTech employees. You must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation. There are 4 main CSV files with 15 Korons geospatial data in tif or shp etc. here The data is based on last 14 days, 6 - 19 January 2014 prior to the disappearance. As part of assignment constraint, only libraries from R programming languages can be used to complete the analytics. The CSV files are namely (1) credit card transaction, (2) loyalty card transaction, (3) geospatial tracking data for company cars which not know to employee, (4) car assignment. The relevant questions to be answered are:
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.
The investigation will be done in 3 main steps:
Data Preparation. Require data cleaning and transformation to aid the data exploratory. Main focus in to establish the required data point to link up the 3 main files associated with geospatial tracking, credit and loyalty cards transactions.
Exploratory Data Analysis (EDA). Explain the various R library that would use to build visualisation tool to aid the EDA to discover the required insights for the questions.
Exploration Sharing. Story telling on the discerned insights.
For credit and loyalty cards csv files, its contained transaction timestamp, location, price and card number. The timestamp within the files was in different formation eg %m/%d/%Y vs %m/%d/%y etc and credit card came with transacted date and time as compared loyalty only have date. Perform required data wrangling based on observation to resolve the datetime format (i.e. using lubridate and clock) and set card numbers as character. In addition, create date (i.e. Date, rename timestamp in loyalty card to date), weekday and hour for subsequent EDA. Code chuck and glimpse of prepared credit card dataset:
Rows: 1,490
Columns: 8
$ timestamp <dttm> 2014-01-06 08:16:00, 2014-01-06 13:27:00, 2014-0…
$ date <date> 2014-01-06, 2014-01-06, 2014-01-07, 2014-01-07, …
$ day <dbl> 6, 6, 7, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 12, 12,…
$ hr <dbl> 8, 13, 7, 13, 8, 13, 20, 8, 13, 20, 8, 20, 13, 15…
$ location <chr> "Brew Served", "Abila Zacharo", "Brew Served", "K…
$ price <dbl> 14.97, 50.14, 11.92, 45.05, 7.26, 50.36, 62.20, 1…
$ last4ccnum <chr> "1286", "1286", "1286", "1286", "1286", "1286", "…
$ wkday <ord> Monday, Monday, Tuesday, Tuesday, Wednesday, Wedn…
As previous, necessary conversion of Timestamp and id variables to the proper data type. To ease the mapping of location to geospatial values, decided to round off its value to four decimal places. The main focus was to determine the trip within the geospatial tracking dataset for respective vehicle ID. Trip was defined as a continues traveling from point A to B with time lag less than 300secs. 300 secs was arbitrary decided based on screening through the data set. With the trip formed, one could determine the traveling locations for respective veh ID overtimes.
Code chuck for determine the trip within the days for respective vehicle ID.Reference to the company employee records in challenge 1, GASTech has 54 staffs. However, unique count on the credit and loyalty card numbers has shown 55 and 54 count respectively. Its alluded that one of the staff could have 2 credit cards. Moreover,a unique count on the number of transacted locations for each data set has shown that there were 34 and 33 locations for credit and loyalty card respectively. The additional location in credit card data set “Daily Dealz” with a transaction amount “2.01”.
To merge both data sets, would determine the common set via inner join operations based on date, location and price. It observed that inner join has 1087 observations, while anti join has 409 and 311 disjoint for credit and loyalty card transaction respectively. The inner join has created 6 extra observations (1392 data points from loyalty - anti join 311 = 1081 or 1490 data points from credit - anti join 409 = 1081) due to observations having similar values defined for the inner join operation. The duplicated observations were identified and deleted. It also validated that the inner join contained the full set of credit and loyalty cards numbers and thus would be used to determine the unique pairing of credit to loyalty card. In addition, it also discerned that the inner join did not contain Kronos Mart and Daily Dealz as the transaction location. This implied that transactions at these two locations likely done by either of the card only.
Code chuck for inner join and finding distinct pair.To determine the pairing, would need to determine the number of distinct count between a credit-loyalty pair based on a fixed credit card number. i.e. X1-Y1, X2-Y2 etc. Logically, there should only be one distinct pairing between both cards assuming a holder has each card. However,it was discerned that some credit cards has more than a distinct pair. eg X1-Y1, X1-Y2 . This could be due to the fact that the credit card holder has more than one loyalty card or vice verse. To visualise the relationship, would use igraph to built a bipartite graph with nodes as credit and loyalty cards, edges as transaction. Edge weight was based on number of transactions for each distinct credit-loyalty pair. To minimise cluttering so to discover the credit card with more than a distinct pair, the data set would be filtered for distinct pairing count >=2 before group by via both cards. Inking was done on the igraph output, the code chuck and bipartite graph were shown below.
Figure 1: Bipartite Graph of Credit(Orange) and Loyalty(Pink) Cards.
From the bipartite graph, L6267 was used by both credit cards 6691 and 6899. That could account for the short of one loyalty card as in Section 4. There should be some form of relationship between credit card 1286 and 9241 holders given the high usage of the latter loyalty card by the former. Nonetheless, 1283 will be tagged to L3572. Moving on, credit card would be used as the anchoring point for exploration where its pairing to loyalty card would be based on the established relationship here. The established pairing was saved in a csv file for future usage.
The full data set for both cards was created using full join on both data sets, removing the duplicate observations determined above. Then mapping of credit cards to the loyalty card was done in full data set.(inclusive of transaction via loyalty card only) In additional, create additional columns like cardtype (transaction with both cards or credit or loyalty card only) to support downstream visualisation. The code chuck and glimpse of the data set was given below.
Rows: 1,801
Columns: 13
$ timestamp <dttm> 2014-01-06 08:16:00, 2014-01-06 13:27:00, 2014-0…
$ date <date> 2014-01-06, 2014-01-06, 2014-01-07, 2014-01-07, …
$ day <int> 6, 6, 7, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 12, 12,…
$ hr <int> 8, 13, 7, 13, 8, 13, 20, 8, 13, 20, 8, 20, 13, 15…
$ location <chr> "Brew Served", "Abila Zacharo", "Brew Served", "K…
$ price <dbl> 14.97, 50.14, 11.92, 45.05, 7.26, 50.36, 62.20, 1…
$ last4ccnum <chr> "1286", "1286", "1286", "1286", "1286", "1286", "…
$ wkday <ord> Monday, Monday, Tuesday, Tuesday, Wednesday, Wedn…
$ loyaltynum <chr> "L3572", "L3288", "L3572", "L3288", "L3572", "L32…
$ ccard <chr> "1286", "1286", "1286", "1286", "1286", "1286", "…
$ cardtype <chr> "both", "both", "both", "both", "both", "both", "…
$ VehID <chr> "22", "22", "22", "22", "22", "22", "22", "22", "…
$ Department <chr> "Security", "Security", "Security", "Security", "…
The geospatial tracking data set has 35 and 5 unique private company car and truck respectively. Since there are 54 GASTech personnel, not everyone could be associated with geospatial tracking data. The only link point between the tracking and transactions data sets would be the GPS values for the transaction locations. The values were determined based on an exploration on discerned patterns that would help to grill down into the GPS values in the tracking data. One should progressively hunt for patterns at a smaller subsets such as locations or credit card with low transaction volume, vehicle ID with smaller number of trips, activities on weekends and then expand to larger data subset. Numerous visualisation aids would be used to assist in triangulate the location GPS values. The visaulisation aids as follows:
Plot of Vehicle ID 33 with Trip 16:
Progressively exploring from small to larger data subset while compiling the GPS values for the interested locations and tagged it to the geographical tracking data. For company truck, one could focus transaction at industries locations such as Abila Airport, Kronos Pipe and irrigation, Carlyle Chemical Inc, Maximum iron and Steel etc and built up from there. With the naming tagged, the geospatial tracking data set would be streamlined to only start and end points of all the respective trips for each vehicle id. See code chuck below to tag location naming to geographical tracking data base on a manually complied naming - GPS values csv file.
Manually, one could start to triangulate the matching of credit card with vehicle id by using following visualisation aid:
Rows: 7,531
Columns: 3
$ id <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …
$ timestamp <dttm> 2014-01-06 07:20:01, 2014-01-06 07:22:04, 2014-01…
$ location <chr> "ID 1", "Hallowed Grounds", "Hallowed Grounds", "G…
Then prepared the column heading into appropriate data format for timevis. Ensure no duplicate.
Rows: 176
Columns: 4
$ cid <chr> "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5…
$ start <dttm> 2014-01-06 07:59:01, 2014-01-06 08:00:34, 2014-01-0…
$ content <chr> "ID 5", "Jack Beans", "Jack Beans", "GASTech", "GAST…
$ group <chr> "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5…
Timevis would allow one to view and compared activities in time series. example the plot below allow one to determine activities for credit card 1286 was a good fit to geospatial tracking data of vehicle id 22. One could determine whether transacted location tally with the gps location based on transacted timeline. However, the data set excluded loyalty card only transaction due to lack of time dimension. Nonetheless, one could use the tracking data to verify indeed whether the card holder present in the location for loyalty card only transaction. Nevertheless, one need to switch between tmap, tidygraph, datatable and timevis to find the most appropriate visualisation for the matching of owner for vehicle id and credit card.
Code chuck and datatable for geospatial tracking tagged with location naming.
During the EDA, its discern there were location within the geospatial tracking data not associated with the transaction locations such as office GASTech, residential for respective vehicle id etc. To determine the location for respective vehicle id, one could find the start and end points for each day from the geospatial tracking data and then determined the most likely residential location. Thereafter map the residential naming to the dataset again. For those remaining unidentified, would use tmap for visual aid and assigned an arbitrary naming to the location. See code chuck for determine home location. To support EDA further, the locations was tagged with appropriate 4 types in a new column “loctype”. u.e. o=office, T = transaction locations, N = non transaction location, H = home, I= industrial locations.
To explore relationship among the employee, tidygraph, ggraph, igraph and visNetwork would be used. Nonetheless, numerous data preparation needed to support the visualisation. Sample of a visIgraph was appended below.
Reference to the chart below, the most popular location was Katerina’s Cafe at a count of 254 transactions vs Hippokampos at 211. The main anomaly centered around imperfect transaction information. The merged data set comprised 1081 transactions with both cards details and 720 with either card details only (i.e. 409 with credit and 311 with loyalty card). The 720 was determined as unique transaction since could not be joined based on date, price and location. Due to lack of date time information on loyalty card transactions, it was difficult to determine whether it was different transactions in the same day or same transaction with different values recorded by the loyalty card. Since it was an investigation, respective card transaction should be taken as an independent and valid transaction until proven otherwise. Thus, there might be duplicated transaction count. Nonetheless, a quick assessment on assuming duplication count, Katerina’s Cafe would remain to be the most popular location with potential duplicated 28 count. The table below provided information on individual card transactions only where one could verify with the chart.The dataset used has the relationship between credit - loyalty cards mapped out in which it would explain in the later section.
We assumed that the credit timestamp information was associated with time of transaction and the card holder left the location within minutes after the transaction. Thus for Katerina’s Cafe, the popular time should around -1hr from the time stamp. i.e. busy during lunch (1200 - 1400hrs) and dinner (1700-2000hrs) hours. Based on the date time chart below, its also observed Katerina’s cafe do not operate for breakfast timing. For weekend, less customers for lunch hours in particular on Sunday but more tend to stream in earlier for dinner. On weekday, dinner crowd tend to stay longer as compared to weekend. Nonetheless, most customer left prior to 22OOhrs. (The time value for Loyalty card only transaction would be taken as null since its do not have the time stamp)
For weekday only transaction, its could categorise into 2 groups, namely industrial vs cafe locations. For industrial locations, it seemed that there were a planned routine for GAStech personnel to visit them. In addition, the GAStech personnel visited most of coffee related cafes only during weekday and in particular at the morning session.
When delved into transaction time, there were interesting patterns being detected. Example, Kronos Mart was the only place with transaction at 3am+. Series of cafe like Bean There Done That, Jack Magical Beans etc have only 12pm sharp transactions which assessed to be abnormal.(Could verify with the data table provided above) Going through the tooltips in the time plot, one could also determine the peak hours based on transaction vol for the specific location if any.
It also observed that most card holders have favourite locations that they would patronise very frequently for breakfast, lunch or dinner. For example credit card holders “4434” and “6816”, Brew Served and Guy Gyros. So any deviation from the patterns could potential connote anomalies.
From the plot on the frequency of transaction for credit cards, one could observe that the transaction activities could potential be cluster into 3 groups. First group (start from 3484 to 8411 in the chart below ) with high frequency of transaction activities and gradually reduced as one move down the chart. Second group (start from 3492 to 4530) with medium frequency of transnational activities but steep reduction curve as move down. Last group, namely 5010, 9152 and 9614 has relatively low frequency of transaction activities. This chart might subsequently aid the correlation of transaction data to geospatial tracking information so that one could associate cluster vehicle ID to vehicle ID icate the card holder based on vehicle ID. Lastly, it was to be noted that credit card 9551 has relative lower number of transactions via both cards as compared to the rest.
The geospatial would assist to verify whether the card holder/vehicle owner did travel to the transacted location once mapping of credit card to vehicle ID was done with confidence. With that location information, one could possible gain insights on the whether individual card transaction was done by the card holder/vehicle owner. Nonetheless, GPS has it own inadequacy such as GPS errors. Moreover, the geospatial tracking provide additional insights on possible location that individual has traveled which was beyond the transacted locations.
Reference to location setting stipulated in section 5.3, a bar chart stipulated on location type was shown below. Code chuck was relatively similar to the above less some changes in the parameters. From